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Technical Field 
This invention relates to audio-visual or multimedia 
communication systems, and more particularly, to a method of 
animating a synthesized model of a human face driven by an audio 
signal • 



Background Art 

10 Interest surrounding the integration of natural or 

synthetic objects in the development of multimedia applications 
to facilitate and increase user-application interaction is 
growing, and in this context the use of anthropomorphic models, 
destined to facilitate man-machine relationship, is being 

15 envisaged. This interest has been recently acknowledged also by 
international standardization organizations. ISO/IEC standard 
14496 "Generic Coding of Audio-Visual Objects" (commonly known as 
the "MPEG-4 standard" and hereinafter referred to as such) , among 
other things, aims at establishing a general framework for such 

2 0 applications. 

In such applications in general, regardless of the 
specific solutions indicated in the MPEG-4 standard, 
anthropomorphic models are conceived to assist other information 
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flows and are seen as objects which can be animated, where 
animation is driven by audio signals, as, for example, speech. 
These signals can also be considered as phonetic sequences, i.e. 
as sequences of "phonemes", where a "phoneme" is the smallest 
5 linguistic unit (corresponding to the idea of a distinctive sound 
in a language) . 

In this case, animation systems able to deform the 
geometry and the appearance of the models synchronized to the 
voice itself need to be developed for the synthetic faces to 

10 assume the typical expressions of speech. The final result to 

which development tends is a talking head, or face, which appears 
natural to the greatest possible extent. 

The application contexts of animated models of this 
kind can range from Internet applications, such as welcome or 

15 help-on-line messages, to co-operative work applications (e.g. e- 
mail browsers) , to professional applications, such as the 
creation of cinema or television post-production effects, to 
video games, etc. 

The models of human faces commonly used are, in 

2 0 general, made on the basis of a geometrical representation 

consisting of a three-dimensional mesh structure (known as a 
"wire- frame") . Animation is based on the application, in 
succession, of suitable transforms to the polygons forming the 
wire- frame (or a respective sub- set) to reproduce the required 

2 5 effect, i.e. in this specific case, the reproduction of movements 
related to speech. 
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The solution envisaged by the MPEG-4 standard for this 
purpose describes the use of a set of facial animation 
parameters, defined independently with respect to the model, to 
ensure interoperability of systems. 
5 This set of parameters is organized on different 

levels: the highest level consists of the so-called "visemes" and 

t "expressions", while the lowest level consists of the elementary 

transforms permitting generic posture of the face. According to 
MPEG-4 standard, a viseme is the visual equivalent of one or more 

10 similar phonemes. 

In this invention, the term viseme is used to indicate 
a shape of the face, associated with the utterance of a phoneme 
and obtained by means of the application of low- level MPEG-4 
parameters, and does not therefore refer to high-level MPEG-4 

15 parameters. 

Various systems for animating facial models driven by 
voice are known in literature. For example, the following 
documents can be quoted: "Converting Speech into Lip Movements: A 
Multimedia Telephone for Hard of Hearing People", by F. 

20 Lavagetto, IEEE Transactions of Rehabilitation Engineering, Vol. 
3, N. 1, March 1995; DIST, Genoa University "Description of 
Algorithms for Speech- to-Facial Movements Transformation", ACTS 
"SPLIT" Project, November 1995; TUB, Technical University of 
Berlin, "Analysis and Synthesis of Visual Speech Movements, ACTS 

25 "SPLIT" Project, November 1995. These systems, however, do not 
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implement MPEG-4 standard compliant parameters and, for this 
reason, are not very flexible. 

An MPEG-4 compliant standard animation method is 
described in Italian Patent Application no. TO98A000842 by the 
5 Applicant. This method associates visemes selected from a set, 

comprising the visemes defined by the MPEG-4 standard and visemes 
specific to a particular language, to phonemes or groups of 
phonemes. According to this method, visemes are split into a 
group of macro parameters, characterizing shape and/or position 

10 of the labial area and of the jaw of the model, and are 

associated to respective intensity values, representing the 
deviation from a neutral position and ensuring adequate 
naturalness of the animated model. 

Furthermore, the macro parameters are split into the 

15 low-level facial animation parameters defined in the MPEG-4 

standard, to which intensity values linked to the macro parameter 
values are associated also, ensuring adequate naturalness of the 
animated model . 

Said method can be used for different languages and 

20 ensures adequate naturalness of the resulting synthetic model. 

However, the method is not based on motion data analysis tracked 
on the face of a real speaker. For this reason, the animation 
result is not very realistic or natural. 
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Disclosure of the Invention 
The method according to this invention is not language 
dependent and makes the animated synthetic model more natural, 
thanks to the fact that it is based on a simultaneous analysis of 
5 the voice and of the movements of the face, tracked on real 
speakers . 

The method of animating a synthesized model of a hximan 
face driven by an audio signal, according to the invention 
comprises an analytic phase, in which 
10 - an alphabet of low level visemes is determined, and a 

synthesis phase, in which 

- the audio driving signal is converted into a sequence 
of low level visemes applied to a model, 

wherein said analytic phase comprises the steps of 
15 - extracting both a set of information representing the 

shape of a speaker face and corresponding sequences of phonetic 
units from a set of audio training signals; 

- compressing said set of information into active shape 
model (ASM) parameter vectors; 

20 - associating to said active shape model (ASM) 

parameter vectors representative of phonetic units an 
interpolation fimction to provide a continuous representation of 
movement between phonemes, wherein said interpolation function is 
a convex combination having combination coefficients variable as 

2 5 a continuous function of time whereby said association determines 
said alphabet of low level visemes; 
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- associating low level parameters of facial animation, 
compliant with Standard ISO/IEC 14496 VER. 1, to said low level 
vi semes; 

wherein said synthesis phase comprises the steps of 

- extracting a sequence of phonetic units of an audio 
driving signal; 

- associating to said sequence of phonetic xinits a 
corresponding sequence of low level visemes as determined in the 
analytic phase; 

- transforming said sequence of low level visemes 
through an interpolation function to provide a continuous 
representation of movement between phonemes, wherein said 
interpolation function is a convex combination having combination 
coefficients variable as a continuous fxxnction of time; and 

wherein the combination coefficients carried out in the 
synthesis phase are the same as those used in the analytic phase. 



2. The combination coefficients B^(t) of the convex 
combinations can be functions of the following type: 



cos 

cos' 
0; 
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where is the Instant of utterance of the nth phonetic units. 
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The wire -frame vertices, corresponding to the model 
feature points, on the basis of which facial animation parameters 
are determined in the analytic phase, can be identified and the 
low- level viseme interpolation operations are conducted by 
5 applying transforms on feature points for each low- level viseme, 
for animating a wire- frame based model. 

For each position to be assiamed by the model in the 
synthesis phase, the transforms can be applied only to the 
vertices of the wire- frame corresponding to the feature points 
10 and the transforms are extended to the remaining vertices by 

means of a convex combination of the transforms applied to the 
vertices of the wire- frame corresponding to the feature points. 

The low- level visemes can be converted into 
co-ordinates of the feature points of the face of the speaker, 
15 followed by conversion of said co-ordinates into said low-level 

facial animation parameters compliant with Standard ISO/IEC 14496 
VER . 1 . 

The low- level facial animation parameters, representing 
the co-ordinates of feature points, can be obtained in the 
2 0 analytic phase by analyzing the movements of a set of markers (7) 
which identify the feature points themselves. 

The data representing the co-ordinates of the feature 
points of the face are normalized according to the following 
method: 

25 a s\ib-set of markers are associated to a stiff object 

(8) applied to the forehead of the speaker; 
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the face of the speaker Is set, at the beginning of the 
recording, to assume a position corresponding as far as possible 
to the position of a neutral face model, as defined in standard 
ISO/IEC 14496 Rev. 1, and a first frame of the face in such 
5 neutral position is obtained; 

for all frames subsequent to the first frame, the sets of 
co-ordinates are rotated and translated so that the co-ordinates 
corresponding to the markers of said STib-set coincide with the 
co-ordinates of the markers of the same sub- set in the first 
10 frame. 

The invention also is a method of generating an 
alphabet of low level visemes for animating a synthesized model 
of a human face driven by an audio signal, comprising the steps 
of 

15 - extracting both a set of information representing the 

shape of a speaker face and corresponding sequences of phonetic 
units from a set of audio training signals; 

- compressing said set of information into active shape 
model (ASM) parameter vectors; 

20 - associating to said active shape model (ASM) 

parameter vectors representative of phonetic units an 
interpolation function to provide a continuous representation of 
movement between phonemes, wherein said interpolation function is 
a convex combination having combination coefficients variable as 

25 a continuous function of time whereby said association determines 
said alphabet of low level visemes. 
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The combination coefficients B„(t) of the convex 



combinations can be functions of the following type: 




where t^ is the instant of utterance of the nth phonetic units. 

The use of the so-called "Active Shape Models" (Active 
Shape Models or ASM, acronym which will be used hereinafter) is 
suggested to animate a facial model guided by voice in the 
documents "Conversion of articulatory parameters into active 
shape model coefficients for lip motion representation and 
synthesis"/ S. Lepsoy and S. Curinga, Image Communication 13 
(1998)/ pages 209-225, and "Active shape models for lip motion 
synthesis", S. Lepsoy, Proceedings of the International Workshop 
on Synthetic -Natural Hybrid Coding and Three Dimensional Imaging 
(IWSNHC3DI 97), Rhodes (Greece), September 1997, pages 200-203, 
which specifically deal with the problem of motion representation 
conversion. The active shape model method is a representation 
technique for distributing points in space, which is particularly 
useful for describing faces and other transformable objects by 
means of a few parameters. These active shape models. 
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consequently, permit data quantity reduction. This is the 
property which will be exploited for the purpose of this 
invention . 

Further details on active shape model theory can be 
5 found, for example, in the document by T. F. Cootes, D. Cooper, 

C. J. Taylor and J. Graham, "Active Shape Models - Their Training 
and Application, Computer Vision and Image Understanding", Vol. 
61, no. 1, Jan. 1995, pages 38-59. 

Brief Description of Drawings 
10 Reference is made to the following drawings for further 

clarification, wherein: 

FIG. 1 shows three pictures of a hxoman face model: a 
wire- frame only picture on the left; a picture with homogenous 
coloring and shading in the middle; a picture with added 
15 texturing on the right; 

FIG. 2 is a flow chart illustrating the analytic 
operations associating the language -specific phonetic data and 
the respective movements of the human face; 

FIG. 3 shows as example of phonetic alignment; 
2 0 FIG. 4 illustrates the set of markers used during a 

generic motion tracking session; 

FIG. 5 is a flow chart illustrating the synthesis 
operations that convert the phonetic flow of a text used for 
driving the true facial model animation; and 
25 FIG. 6 illustrates an example of model animation. 
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Best mode for Carrying Out the Invention 
The following generic premises must be made before 
describing the invention in detail . 

Animation is driven by phonetic sequences in which the 
5 instant of time when each phoneme is uttered is known. This 
invention describes an animation method which is not language 
dependent: this means that the sequence of operations to be 
followed is the saune for each language for which movement of 
speech is to be reproduced. This invention permits the 

10 association of the respective movements of the human face to the 
phonetic data which is specific to a language. Such movements 
are obtained by means of statistic analysis, providing very 
realistic animation effects. In practice, given the case of a 
model obtained on the basis of a wire- frame, animation consists 

15 in applying a set of movements, created as movements relative to 
a basic model, representing an inexpressive or neutral face, as 
defined in the MPEG-4 standard, to the vertices of the wire- 
frame. These relative movements are the result of a linear 
combination of certain basic vectors, called auto- transforms . One 

20 part of the analysis, described below, will be used to find a set 
of such vectors. Another part will be used to associate a 
transform, expressed in terms of low- level animation parameters - 
the so-called FAPs (Facial Animation Parameters) , defined in the 
MPE6-4 standard - to each phoneme. 

25 The animation, or synthesis, phase will then consist in 

transforming the sequence of visemes, corresponding to the 
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phonemes in the specific driving text, into the sequence of 
movements for the vertices of the wire- frame on which the model 
is based. 

A human face model, created on the basis of a wire- 
5 frame structure, is shown in figure 1 to facilitate the 

comprehension of the following description. Number 1 indicates 
the wire- frame structure, number 2 is associated to the texture 
(i.e. to a surface which fills the wire- frame crossing the 
vertices of the wire- frame itself) and number 3 indicates the 

10 model completed with the picture of a real person. The creation 
method of a model on the basis of the wire- frame is not part of 
this invention and will not be further described herein. An 
example of the process related to this creation is described by 
the Applicant in Italian patent application no. TO 98A000828. 

15 Figure 2 illustrates the analytic phase related to the 

process according to this invention in greater detail. 

A speaker 4 utters, in one or more sessions, the 
phrases of a set of training phrases and, while the person 
speaks, both the voice and the facial movements are recorded by 

20 means of suitable soxind recording devices 5 and television 

cameras 6. At the same time, a phonetic transcription of the 
uttered texts is made to obtain the phonemes present in the text. 

The voice recording devices can be analogue or digital 
devices providing an adequate quality to permit subsequent 

25 phonetic alignment, i.e. to permit the identification of the 
instants of time in which the various phonemes are uttered. 
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This means that the temporal axis is split into 
intervals, so that each interval corresponds to the utterance of 
a certain phoneme ("Audio segmentation" step in figure 2) . An 
instant is associated to each interval, instant in which the 
5 phoneme is subjected to the minimal influence of the adjacent 
phonemes. Hereinafter, the instant described above will be 
understood when reference is made to a temporal instant linked to 
a phoneme. 

Reference can be made to figure 3 and to Table 1 below, 
10 both pertaining to the phonetic analysis and phonetic 

transcription, with respective timing, of the phrase "Un 
trucchetto geniale gli valse 1' assoluzione" to clarify the 
concept of phonetic alignment. 



TABLE 1 



# 


0.014000 


u 


0.077938 


n 


0. 166250 


t 


0.216313 


r 


0.246125 


u 


0.296250 


k: 


0.431375 


'e 


0.521872 


t: 


0.619250 


o 


0.695438 


Dg 


0.749188 


e 


0.811375 


n 


0.858938 


j 


0.920625 


'a 


1.054101 


1 


1.095313 


e 


1.153359 


Gl 


1.254000 
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I 1.288125 

V 1.339656 

'a 1.430313 

1 1.464000 

5 s 1.582188 

e 1.615688 

1 1.654813 

a 1.712982 

s: 1.840000 

10 o 1.873063 

1 1.899938 

u 1.966375 

Ts: 2.155938 

j 2. .239875 

15 'o 2.364250 

n 2.416875 

e 2.606188 

@ 2.617500 
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Voice and movement are recorded in a synchronized 
fashion. Consequently, phonetic alignment provides the 
information on which phoneme was uttered in each frame. This 
information permits estimation of the geometric equivalent of the 
5 face for each phoneme of the alphabet. 

Again with reference to figure 2 and considering the 
recording of facial movements, this recording is advantageously 
obtained by means of the "motion tracking" technique, which 
permits very plausible animation based on examination of 

10 movements of a set of markers located at significant facial 

features, e.g. the corners of the eyes, the edge of the lips and 
the face. These markers are indicated with nTomber 7 in figure 4. 
The points selected for the markers will be called "landmarks" or 
"feature points". The markers are generally small objects, the 

15 special position of which can be detected by means of optical or 
magnetic devices. 

The motion tracking technique is well known in the 
sector and does not require further explanation herein. A 
certain niomber of phrases, at least one hundred, need to be 

2 0 recorded for each language, to obtain a significant set of data. 
Consequently, due to the limitations of motion tracking device 
internal storage capacity and errors in phrase reading, the 
recording should preferably be carried out in several sessions, 
each of which will be dedicated to one or more phrases. 

25 The data obtained by tracking the motion of markers 7 

consist of a set of co-ordinates which are not suitable for 
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direct analysis for several reasons. This is because differences 
in the position of the s\ibject will result if several shooting 
sessions are carried out. Furthermore, the inevitable head 
movements must be deleted from the data. The objective is to 
5 model the movements related to a neutral posture of the face and 
not the absolute movements. Aspects will also depend on the 
devices employed. Errors in recorded data may occur, such as 
sudden movements and disappearance of some markers for a certain 
time. These errors require a correction phase in order to obtain 

10 relicQDle data. In other words, correction and normalisation of 
raw data is required. 

For this purpose, at the beginning of each recording, 
the speaker's face must assume, as far as possible, the neutral 
position of the face defined in the MPEG- 4 standard. 

15 Normalization (or training data cleaning) consists in aligning a 
set of points, corresponding to markers 7, with the respective 
feature points in a generic model of a neutral face. Spatial 
orientation, position and dimension of this facial model are 
known. The parameters of this transformation are computed on the 

20 basis of the first frame in the recording. The reference to a 
frame in the sequence is required because the markers 7 may not 
be in the same position in different recordings. This operation 
is carried out for each recorded sequence. 

In practice, a certain number of markers, e.g. three, 

25 used for the recording lie on a stiff object which is applied to 
the forehead (the object indicated with number 8 in figure 4) and 
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are used to nullify the inevitcQ^le movements of the subject's 
entire head during recording. As an example, for the sake of 
simplicity, we can suppose that the first three markers are used. 
Consequently, the sets of co-ordinates are rotated and translated 
5 for all frames subsequent to the first in a sequence, so that the 
first three markers coincide with the corresponding markers in 
the first frame. 

After this operation, the first three markers are no 
longer used. Furthermore, the positions of the feature points on 

10 the real face of each picture will need to coincide to the 

greatest possible extent with the positions of the model chosen 
as the neutral face, and this entails scaling the recorded 
picture to adapt it to the dimensions of the model, and 
translating it. As mentioned, the first three markers are no 

15 longer used for this phase. 

In order to handle a larger quantity of movement data 
(and, for some embodiments, also to reduce the quantity of data 
to be transmitted) , a compressed representation of the movements 
must be foTind. This compression exploits the fact that movement 

20 in various areas of the face is correlated: consequently, 

according to this invention, the niimeric representation of the 
movements is compressed and expressed, as mentioned above, as 
combinations of a few basic vectors, called auto- transforms . The 
auto -trans forms must allow the closest possible approximation of 

25 facial movements contained in the recorded and transformed 

sequence. It is emphasised that the movements herein treated 
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relate to a neutral posture. The objective of compression is 
reached by means of principle component analysis (PCA) , a 
constituent part of ASM. The main components resulting from this 
analysis are identical to auto -trans forms and have the same 
5 meaning in the invention. 

The posture of the face (i.e. the positions of feature 
points) assumed during speech, can be approximated with a certain 
accuracy as a linear combination of auto-transf orms . These 
linear combinations offer a representation of visemes being 
10 expressed as positions of feature points (by means of lower level 
parameters) . The coefficients of the linear combination are 
called ASM parameters. Siimmarizing, a vector x, containing the 
co-ordinates of feature points, is the resulting transform with 

respect to a neutral face, with co-ordinates in a vector , by 

15 means of the sum where P is a matrix containing the 

auto -trans forms as columns and v is a vector with ASM parameters. 

The ASM model permits expression of the posture assumed 
by the face during motion tracking by means of a vector 
consisting of a few parameters. For the purpose of example, the 
2 0 co-ordinates of 41 markers can be approximated with satisfying 
results using 10 ASM parameters. Furthermore, these operations 
suppress a component of noise inherent to the acquisition system, 
i.e. which is not correlated to facial movement. 
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The low- level viseme calculation phase follows, after collecting 
voice and movement information. 

The objective of this phase is to determine a vector of 
ASM parameters associated to each single phoneme, i.e. the 
5 viseme. The basic criterion is to create a synthesis (i.e. 

animation) which can best approximate the recorded movement. It 
is important to stress that this criterion is adopted in the 
invention to estimate the parameters used in the synthesis phase; 
this means that it is possible to reproduce the movement of any 

10 phrase, not only the phrases belonging to the set of phrases 

recorded during motion tracking. The animation, as mentioned, is 
guided by phonemes, which are associated to the respective 
temporal instants. A very discontinuous representation of 
movement corresponding to the instants of time associated to the 

15 phonemes would result if the visemes associated to the individual 
phonemes of an animation driving test were used directly. In 
practice, the movement of the face is a continuous phenomenon 
and, consequently, contiguous visemes must be interpolated to 
provide a continuous (and consequently more natural) 

20 representation of motion. 

Interpolation is a convex combination of low- level 
visemes to be computed in which the coefficients of the 
combination (weights) are defined according to time. Note that a 
linear combination is defined convex when all coefficients are in 

2 5 the [0, 1] interval and their sum is equal to 1. The 

interpolation coefficients generally have a value other than zero 
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only in a small interval surroiinding the instant of utterance, 
where the coefficient value reaches the maximum. In the case in 
which passing interpolation for low- level visemes (forming the 
interpolation nodes) is required, all coefficients must be equal 
to zero in the temporal instant of a certain phoneme, except for 
that of the specific low- level viseme which must be equal to one. 

An example of a function which can be used for the 
coefficients follows : 



where is the insteint of utterance of the nth phoneme. 

The operations described hereinafter are used to 
respect the approximation criterion of the recorded movement with 
the synthesised movement. The low- level viseme vectors can be 
grouped in rows forming a matrix V. The coefficients of the 

convex combination can be in turn grouped in a row vector 

The convex combination of visemes is consequently formed by the 

product . The vector of the coefficients is a function of 

time and a matrix C can be formed in which each row contains the 
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coefficients of an Instant In time* For the analysis, the 
Instants for which motion tracking data exists are selected. The 
product CV contains rows of ASM vectors which can approximate the 
natural movement contained In tracking data. The purpose of this 
5 step Is to determine the elements In the V matrix containing the 
low- level vl semes, so as to minimise the gap between natural 
movement (that of the observed frames) and the syntheslsed 
movement . 

Advantageously, the mean sc[uare distance between the 
10 rows of the product CV and the ASM vectors, representing the 
recorded movement. Is minimised, as defined by the Euclidean 
rule. 

After computing the low- level vlsemes, the following 
step consists In passing from the compressed representation, 

15 obtained by means of the operations described above, to a 

position In space of the feature points defined In the MPEG-4 
standard. Considering that the computed low- level vlsemes are 
vectors containing ASM coefficients, conversion can be obtained 
by means of a simple matrix product, as described In the active 

/!o shape model theory. A vector containing the feature point 

transform Is obtained by multiplying the auto- transform matrix 
for the ASM vector (as a column) . 

In turn, the facial animation parameters on a lower 
level express the position of feature points related to an 

25 Inexpressive face. Consequently, the translation of vlsemes, 
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represented as positions of feature points on these low- level 
parameters, is immediate. 

After performing the operations described eJDOve on all 
the phrases of the training set, the tsQjle linking the low-level 
5 facial animation parameters to the phonemes, which will then be 
used in the synthesis (or animation) phase, is made. 

Reference is hereto made to the chart in figure 5, 
illustrating the operations related to synthesis or animation of 
the model starting from a given driving text. 

10 "Synthesis" herein means computing movements for a wire-frame on 
the basis of phonetic and temporal information, so that the 
transforms are synchronised with associated sounds and closely 
reproduce lip movement. Synthesis is, consequently, the process 
which converts a sequence of visemes into a sequence of wire- 

15 frame co-ordinates, representing the face to be animated. 

Synthesis is based on the correspondence table between phonemes 
and low-level MPEG-4 FAPs, resulting from the analysis process. 
Consequently, the animation process takes the wire- frame to be 
animated, the phonemes contained in the phrase to be reproduced 

20 and the low- level mi/FAPs table as inputs. The wire- frame is 

specified by a set of points in space, by a set of polygons which 
exploit the previous points as vertices and by information 
inherent to the appearance of the surface, such as colour and 
texture. 

2 5 To reproduce a given driving signal (generally, a 

phrase) , firstly the phrase must be transcribed as a sequence of 
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phonemes, each of which is labelled by the instant in time in 
which it was uttered, as shown in the example in Table 1. A 
discreet sequence of low- level visemes corresponds to this 
discreet sequence. The sequence of phonemes can be obtained in 
5 different ways, according to the source of the phrase to be 

reproduced. In the case of synthesised sound, in addition to 
generating the wave shape of speech, the synthesiser will 
generate the phonetic transcription and respective time 
reference. In the case of natural voice, this information must 

10 be extracted from the audio signal. Typically, this operation 
can be carried out in two different ways, according to whether 
the phonemes contained in the uttered phrase are known or not. 
The first case is called "phonetic alignment" and the second case 
is called "phonetic recognition", which generally provides lower 

15 quality results. These proceedings are all known in literature 
and are not the subject of this invention. 

To ensure the naturalness and fluidity of movement of 
the animated face, a high number of pictures or frames per second 
(e.g. at least 16 frames) is required. This nxomber is 

20 considerably higher than the number of phonemes contained in the 
driving signal . Consequently, numerous intermediate movements of 
the face contained between two subsequent phonemes will need to 
be determined, as shown in better detail below. 

With reference to the creation of a single frame, it is 

2 5 stressed that facial animation parameters are taken from feature 
points. For this reason, which vertices in the wire- frame 
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correspond to the considered feature points must be known. This 
information is obtained by means of a method which is similar to 
that used in the analytic phase, i.e. by multiplying the 
coefficient vector related to the primary components by the 
5 primary component matrix. In this way, the FAPs are transformed 
into movements of the vertices. Considering that the MPEG-4 
standard specifies that the wire- frame should have a predefined 
spatial orientation, the FAP transformation into movements is 
immediate, considering that the FAPs are specified in units of 

10 measure related to the dimension of the face. 

The model reproducing the face comprises, in general, a 
nvunber of vertices which is much higher than the number of 
feature points. The movement of feature points must be 
extrapolated to obtain a defined movement of all vertices. The 

15 motion of each vertex not associated to a feature point will be a 
convex combination of the movements of feature points. The 
relative coefficients are calculated on the basis of the distance 
between the vertex to be moved and each of the feature points, 
and for this purpose the minimum length of distance along the 

.10 arches of the wire- frame, known as Dijkstra's distance, is used 

(E. Dijkstra, "A note on two problems in connection with graphs", 
Numerische Mathematik, vol. 1, p. 269-271, Springer Verlag, 
Berlin, 1959) . The contribution provided by a feature point to a 
vertex is inversely proportional to Dijkstra's distance between 

25 two points, to the nth power. This power is determined with the 
objective of providing greater importance to feature points close 
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to the vertex to be moved and is independent from the dimension 
of the wire-f rcuiie. 

The latter operation results in a representation of the 
low-level viseme on the entire wire-frame. The use of the method 
5 described above presents the advantage that all feature points 

act on all vertices, and therefore the specification of a sub-set 
of such points for each vertex to be moved is no longer required. 
This permits elimination of a work phase which otherwise must be 
carried out manually and is, consequently, extremely expensive, 

10 considering the high number of vertices in wire -frames also in 
the case of relatively simple models. 

Figure 6 shows how the vi semes corresponding to the 
phonemes a, m, p:, u (EURO-MPPA phonetic symbols) in the Italian 
language are expressed by altering the structure of an entire 

15 textured wire- frame. 

As previously mentioned, temporal evolution must be 
considered for synthesising a phrase. The starting point is the 
sequence of known low- level visemes in discreet instants. In 
order to use a frequency of frames, variable or not, at will, the 

2 0 movement of the model is represented as a continuous function in 
time. The representation as a continuous function in time is 
obtained by the interpolation of low- level visemes, achieved in a 
similar fashion as described in the analytic phase. 

A scaling acting as a coefficient in a convex 

25 combination is associated to each low-level viseme; this 

coQ££d)cdsn$ te bheomlitBiipQiafiiinntrontffifetpmevandsi? uem^utadthe 
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analytic phase for computing the low- level visemes. For reasons 
of efficiency, the computation is preferably carried out by 
interpolation and the niunber of feature points is lower than the 
number of vertices. The continuous representation can be sampled 
at will to obtain the individual frames which shown in sequence 
and synchronised with sound, reproduce an animation on a 
computer . 

The description herein is provided as a non- limiting 
example and obviously variations and changes are possible within 
the scope of protection of this invention. 
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